Introduction
Airline passenger satisfaction is a crucial metric for firms in the airline industry. Understanding the factors that contribute to customer satisfaction is essential for airlines to improve their services and compete effectively; high market saturation, as well as low profit margins, can magnify the effects of small advantages or disadvantages relative to other firms (Lutz et al., 2012; Hardee, 2023). In this research, we will analyze various factors that affect airline passenger satisfaction — provided through a survey dataset — and, ultimately, judge their suitability for a regression model predicting passenger satisfaction.
Research Proposal
Our research will first look at individual variables in the aforementioned survey dataset to examine distributions and other characteristics. Then, we will identify a regression model that may be congruent with our dataset and test assumptions associated with the model.
We will leverage a Kaggle dataset that includes surveyed passenger characteristics, flight details, and satisfaction ratings for select pre-flight and in-flight components (Klein, 2020). To ensure modeling suitability, we will conduct exploratory data analysis, taking into account variable distributions and types.
SMART Questions
With our research, we aim to make progress towards answering the following questions:
To what extent do certain surveyed passenger characteristics and flight experience components impact the likelihood that a passenger will be satisfied – rather than neutral or dissatisfied – with their trip?
How can we model the likelihood of passenger satisfaction using surveyed passenger characteristics and flight experience components in a manner that minimizes predictive bias?
To what extent can we predict the likelihood that a flight passenger will be satisfied with their experience using multiple different variable levels?
Objective
This research offers an opportunity to assess the limitations of linear regression models in predicting passenger satisfaction, specifically with regards to the categorical nature of the output in this dataset. Through exploratory data analysis (EDA), we can identify the characteristics of our data and subsequently illustrate why a linear regression model may not be suitable for this analysis. This will lay the groundwork for our future research on logistic regression.
In summary, our research will provide insights into the intricate relationship between passenger characteristics, flight experience, and satisfaction levels. We will also explore the limitations of linear regression models and prepare the foundation for a more advanced logistic regression approach in future analysis.
Dataset Variables
The dataset for our research on airline passenger satisfaction contains various variables, which can be categorized into three types: continuous, categorical, and ordinal. In this section, we’ll list and briefly explain each of these variables.
Continuous Variables
Age: This variable represents the actual age of the passengers.
Flight Distance: Flight distance is the distance covered during the journey, measured in miles.
Departure Delay in Minutes: This variable indicates the number of minutes by which a flight was delayed during departure.
Arrival Delay in Minutes: Similarly, this variable represents the number of minutes by which a flight was delayed during arrival.
Categorical Variables
Gender: Gender is a categorical variable indicating the gender of the passengers.
Customer Type: The “Customer Type” variable categorizes passengers based on their customer loyalty.
Type of Travel: This variable categorizes the purpose of the flight.
Class: “Class” indicates the travel class in the plane.
Ordinal Variables
The following variables represent satisfaction levels, which are ordinal in nature, with values ranging from 0 to 5. According to the documentation, 0 is used to encode “Not Applicable” values.
Inflight Wifi Service: Satisfaction level of the inflight wifi service.
Departure/Arrival Time Convenient: Satisfaction level of departure/arrival time convenience.
Ease of Online Booking: Satisfaction level of online booking.
Gate Location: Satisfaction level of gate location.
Food and Drink: Satisfaction level of food and drink.
Online Boarding: Satisfaction level of online boarding.
Seat Comfort: Satisfaction level of seat comfort.
Inflight Entertainment: Satisfaction level of inflight entertainment.
On-board Service: Satisfaction level of on-board service.
Leg Room Service: Satisfaction level of leg room service.
Baggage Handling: Satisfaction level of baggage handling.
Check-in Service: Satisfaction level of check-in service.
Inflight Service: Satisfaction level of inflight service.
Cleanliness: Satisfaction level of cleanliness.
Target Variable
- Satisfaction: The “Satisfaction” variable represents the airline passenger’s satisfaction level and includes two categories: “satisfied” or “neutral or dissatisfied.” This will be our primary outcome variable for analysis.
In our research, we will explore how these variables interact and contribute to passenger satisfaction levels. We will use statistical methods and modeling techniques to gain insights into the factors that lead to customer satisfaction for an airline.
Variable limitations
While the analysis and insight generation opportunities are manyfold, certain fields in this dataset can present challenges limiting a resulting model’s predictive validity. These include:
Data collection: this dataset was sourced from Kaggle (Klein, 2020). While some variable-related documentation is available, we are not able to discern the circumstances under which this survey was distributed. The population may have been sampled through certain methods—such as convenience sampling—that make resulting data less representative of the overall population despite the large observation count. The overall population in question also is not clear; the survey may have focused on a particular airport or region, limiting potential predictive validity in alternative settings.
Loyal/disloyal clarity: the document does not elaborate upon what counts as a “loyal” or “disloyal” customer for that field. This makes it difficult to properly interpret the effects of such a variable in a regression model. The threshold for disloyalty could potentially range from using any other airlines at all to using other airlines a majority of the time, drastically altering any potential real-world applications.
Ticket prices: ticket prices are not included in this survey, with class serving as a rough proxy; intuitively, such prices could play a major factor in passengers’ service expectations and their subsequent ratings. The lack of price ranges associated with seat class also makes it difficult to encode the three categories in a way that accurately captures the disparity.
Loading the Data
We first imported the data into R by using read.csv()
function. The first few rows in the dataset are included below.
| X | id | Gender | Customer.Type | Age | Type.of.Travel | Class | Flight.Distance | Inflight.wifi.service | Departure.Arrival.time.convenient | Ease.of.Online.booking | Gate.location | Food.and.drink | Online.boarding | Seat.comfort | Inflight.entertainment | On.board.service | Leg.room.service | Baggage.handling | Checkin.service | Inflight.service | Cleanliness | Departure.Delay.in.Minutes | Arrival.Delay.in.Minutes | satisfaction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 70172 | Male | Loyal Customer | 13 | Personal Travel | Eco Plus | 460 | 3 | 4 | 3 | 1 | 5 | 3 | 5 | 5 | 4 | 3 | 4 | 4 | 5 | 5 | 25 | 18 | neutral or dissatisfied |
| 1 | 5047 | Male | disloyal Customer | 25 | Business travel | Business | 235 | 3 | 2 | 3 | 3 | 1 | 3 | 1 | 1 | 1 | 5 | 3 | 1 | 4 | 1 | 1 | 6 | neutral or dissatisfied |
| 2 | 110028 | Female | Loyal Customer | 26 | Business travel | Business | 1142 | 2 | 2 | 2 | 2 | 5 | 5 | 5 | 5 | 4 | 3 | 4 | 4 | 4 | 5 | 0 | 0 | satisfied |
| 3 | 24026 | Female | Loyal Customer | 25 | Business travel | Business | 562 | 2 | 5 | 5 | 5 | 2 | 2 | 2 | 2 | 2 | 5 | 3 | 1 | 4 | 2 | 11 | 9 | neutral or dissatisfied |
| 4 | 119299 | Male | Loyal Customer | 61 | Business travel | Business | 214 | 3 | 3 | 3 | 3 | 4 | 5 | 5 | 3 | 3 | 4 | 4 | 3 | 3 | 3 | 0 | 0 | satisfied |
Checking data structure and dimensions
Data structure
## 'data.frame': 103904 obs. of 25 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ id : int 70172 5047 110028 24026 119299 111157 82113 96462 79485 65725 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Customer.Type : chr "Loyal Customer" "disloyal Customer" "Loyal Customer" "Loyal Customer" ...
## $ Age : int 13 25 26 25 61 26 47 52 41 20 ...
## $ Type.of.Travel : chr "Personal Travel" "Business travel" "Business travel" "Business travel" ...
## $ Class : chr "Eco Plus" "Business" "Business" "Business" ...
## $ Flight.Distance : int 460 235 1142 562 214 1180 1276 2035 853 1061 ...
## $ Inflight.wifi.service : int 3 3 2 2 3 3 2 4 1 3 ...
## $ Departure.Arrival.time.convenient: int 4 2 2 5 3 4 4 3 2 3 ...
## $ Ease.of.Online.booking : int 3 3 2 5 3 2 2 4 2 3 ...
## $ Gate.location : int 1 3 2 5 3 1 3 4 2 4 ...
## $ Food.and.drink : int 5 1 5 2 4 1 2 5 4 2 ...
## $ Online.boarding : int 3 3 5 2 5 2 2 5 3 3 ...
## $ Seat.comfort : int 5 1 5 2 5 1 2 5 3 3 ...
## $ Inflight.entertainment : int 5 1 5 2 3 1 2 5 1 2 ...
## $ On.board.service : int 4 1 4 2 3 3 3 5 1 2 ...
## $ Leg.room.service : int 3 5 3 5 4 4 3 5 2 3 ...
## $ Baggage.handling : int 4 3 4 3 4 4 4 5 1 4 ...
## $ Checkin.service : int 4 1 4 1 3 4 3 4 4 4 ...
## $ Inflight.service : int 5 4 4 4 3 4 5 5 1 3 ...
## $ Cleanliness : int 5 1 5 2 3 1 2 4 2 2 ...
## $ Departure.Delay.in.Minutes : int 25 1 0 11 0 0 9 4 0 0 ...
## $ Arrival.Delay.in.Minutes : num 18 6 0 9 0 0 23 0 0 0 ...
## $ satisfaction : chr "neutral or dissatisfied" "neutral or dissatisfied" "satisfied" "neutral or dissatisfied" ...
X and id: These columns represent some unique identifiers for each observation. X appears to be an integer index, while id is also an integer and likely represents a customer ID or some form of identifier.
Gender: This column contains information about the gender of the passengers, with values such as “Male” and “Female.”
Customer.Type: This variable describes the customer as a “Loyal Customer” or a “disloyal Customer.”
Age: Represents the age of the passengers and is an integer variable.
Type.of.Travel: Indicates the purpose of travel with two levels, “Personal Travel” and “Business Travel.”
Class: Specifies the class of travel with three levels, including “Business,” “Economy,” and “Economy Plus.”
Flight.Distance: This variable contains the distance of the flight in miles as an integer.
Inflight.wifi.service, Departure.Arrival.time.convenient, and several other columns: These variables seem to represent passengers’ ratings or feedback on different aspects of their flight experience. They are integer variables with ratings ranging from 0 to 5.
Departure.Delay.in.Minutes and Arrival.Delay.in.Minutes: These columns represent the delay in minutes for departure and arrival, respectively. Departure delay is an integer, while arrival delay is a numeric variable; this betrays initial expectations, since we would have expected both delay columns to contain identical types. The likely culprit is a discrepancy in respondents’ uses of decimal values to represent delays.
satisfaction: This is the target variable or the outcome of interest, and it represents customer satisfaction levels with values like “neutral or dissatisfied” and “satisfied.”
Data dimensions
This is a data frame with 103904 observations (rows) and 25 variables (columns). Assuming that a robust sampling method was utilized, the large number of observations may allow us to conclude that the data is generally representative of the actual population.
An initial description of the data
## data
##
## 25 Variables 103904 Observations
## ------------------------------------------------------------
## X
## n missing distinct Info Mean Gmd
## 103904 0 103904 1 51952 34635
## .05 .10 .25 .50 .75 .90
## 5195 10390 25976 51952 77927 93513
## .95
## 98708
##
## lowest : 0 1 2 3 4
## highest: 103899 103900 103901 103902 103903
## ------------------------------------------------------------
## id
## n missing distinct Info Mean Gmd
## 103904 0 103904 1 64924 43260
## .05 .10 .25 .50 .75 .90
## 6593 13044 32534 64856 97368 116884
## .95
## 123410
##
## lowest : 1 2 3 4 5
## highest: 129874 129875 129878 129879 129880
## ------------------------------------------------------------
## Gender
## n missing distinct
## 103904 0 2
##
## Value Female Male
## Frequency 52727 51177
## Proportion 0.507 0.493
## ------------------------------------------------------------
## Customer.Type
## n missing distinct
## 103904 0 2
##
## Value disloyal Customer Loyal Customer
## Frequency 18981 84923
## Proportion 0.183 0.817
## ------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd
## 103904 0 75 1 39.38 17.32
## .05 .10 .25 .50 .75 .90
## 14 20 27 40 51 59
## .95
## 64
##
## lowest : 7 8 9 10 11, highest: 77 78 79 80 85
## ------------------------------------------------------------
## Type.of.Travel
## n missing distinct
## 103904 0 2
##
## Value Business travel Personal Travel
## Frequency 71655 32249
## Proportion 0.69 0.31
## ------------------------------------------------------------
## Class
## n missing distinct
## 103904 0 3
##
## Value Business Eco Eco Plus
## Frequency 49665 46745 7494
## Proportion 0.478 0.450 0.072
## ------------------------------------------------------------
## Flight.Distance
## n missing distinct Info Mean Gmd
## 103904 0 3802 1 1189 1066
## .05 .10 .25 .50 .75 .90
## 175 236 414 843 1743 2750
## .95
## 3383
##
## lowest : 31 56 67 73 74, highest: 4243 4502 4817 4963 4983
## ------------------------------------------------------------
## Inflight.wifi.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.956 2.73 1.492
##
## Value 0 1 2 3 4 5
## Frequency 3103 17840 25830 25868 19794 11469
## Proportion 0.030 0.172 0.249 0.249 0.191 0.110
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Arrival.time.convenient
## n missing distinct Info Mean Gmd
## 103904 0 6 0.962 3.06 1.716
##
## Value 0 1 2 3 4 5
## Frequency 5300 15498 17191 17966 25546 22403
## Proportion 0.051 0.149 0.165 0.173 0.246 0.216
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Ease.of.Online.booking
## n missing distinct Info Mean Gmd
## 103904 0 6 0.961 2.757 1.578
##
## Value 0 1 2 3 4 5
## Frequency 4487 17525 24021 24449 19571 13851
## Proportion 0.043 0.169 0.231 0.235 0.188 0.133
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Gate.location
## n missing distinct Info Mean Gmd
## 103904 0 6 0.952 2.977 1.437
##
## Value 0 1 2 3 4 5
## Frequency 1 17562 19459 28577 24426 13879
## Proportion 0.000 0.169 0.187 0.275 0.235 0.134
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Food.and.drink
## n missing distinct Info Mean Gmd
## 103904 0 6 0.956 3.202 1.499
##
## Value 0 1 2 3 4 5
## Frequency 107 12837 21988 22300 24359 22313
## Proportion 0.001 0.124 0.212 0.215 0.234 0.215
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Online.boarding
## n missing distinct Info Mean Gmd
## 103904 0 6 0.951 3.25 1.501
##
## Value 0 1 2 3 4 5
## Frequency 2428 10692 17505 21804 30762 20713
## Proportion 0.023 0.103 0.168 0.210 0.296 0.199
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Seat.comfort
## n missing distinct Info Mean Gmd
## 103904 0 6 0.945 3.439 1.462
##
## Value 0 1 2 3 4 5
## Frequency 1 12075 14897 18696 31765 26470
## Proportion 0.000 0.116 0.143 0.180 0.306 0.255
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.entertainment
## n missing distinct Info Mean Gmd
## 103904 0 6 0.95 3.358 1.49
##
## Value 0 1 2 3 4 5
## Frequency 14 12478 17637 19139 29423 25213
## Proportion 0.000 0.120 0.170 0.184 0.283 0.243
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## On.board.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.947 3.382 1.433
##
## Value 0 1 2 3 4 5
## Frequency 3 11872 14681 22833 30867 23648
## Proportion 0.000 0.114 0.141 0.220 0.297 0.228
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Leg.room.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.95 3.351 1.471
##
## Value 0 1 2 3 4 5
## Frequency 472 10353 19525 20098 28789 24667
## Proportion 0.005 0.100 0.188 0.193 0.277 0.237
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Baggage.handling
## n missing distinct Info Mean Gmd
## 103904 0 5 0.926 3.632 1.282
##
## Value 1 2 3 4 5
## Frequency 7237 11521 20632 37383 27131
## Proportion 0.070 0.111 0.199 0.360 0.261
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Checkin.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.946 3.304 1.408
##
## Value 0 1 2 3 4 5
## Frequency 1 12890 12893 28446 29055 20619
## Proportion 0.000 0.124 0.124 0.274 0.280 0.198
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Inflight.service
## n missing distinct Info Mean Gmd
## 103904 0 6 0.924 3.64 1.274
##
## Value 0 1 2 3 4 5
## Frequency 3 7084 11457 20299 37945 27116
## Proportion 0.000 0.068 0.110 0.195 0.365 0.261
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Cleanliness
## n missing distinct Info Mean Gmd
## 103904 0 6 0.953 3.286 1.471
##
## Value 0 1 2 3 4 5
## Frequency 12 13318 16132 24574 27179 22689
## Proportion 0.000 0.128 0.155 0.237 0.262 0.218
##
## For the frequency table, variable is rounded to the nearest 0
## ------------------------------------------------------------
## Departure.Delay.in.Minutes
## n missing distinct Info Mean Gmd
## 103904 0 446 0.82 14.82 24.68
## .05 .10 .25 .50 .75 .90
## 0 0 0 0 12 44
## .95
## 78
##
## lowest : 0 1 2 3 4, highest: 933 978 1017 1305 1592
## ------------------------------------------------------------
## Arrival.Delay.in.Minutes
## n missing distinct Info Mean Gmd
## 103594 310 455 0.823 15.18 25.15
## .05 .10 .25 .50 .75 .90
## 0 0 0 0 13 44
## .95
## 79
##
## lowest : 0 1 2 3 4, highest: 952 970 1011 1280 1584
## ------------------------------------------------------------
## satisfaction
## n missing distinct
## 103904 0 2
##
## Value neutral or dissatisfied satisfied
## Frequency 58879 45025
## Proportion 0.567 0.433
## ------------------------------------------------------------
- Variable X and ID:
- Variable ‘X’ is an integer index ranging from 0 to 103903 with no missing values.
- Variable ‘id’ represents customer IDs and is also an integer, ranging from 1 to 129880 with no missing values.
- Gender:
- There are two distinct values, ‘Female’ and ‘Male,’ with roughly equal proportions of female (50.7%) and male (49.3%) passengers.
- Customer Type:
- Two distinct types of customers are present: ‘disloyal Customer’ and ‘Loyal Customer.’ ‘Loyal Customer’ is the dominant type, accounting for approximately 81.7% of passengers.
- Age:
- The age variable ranges from 7 to 85 with a mean age of approximately 39.38. 50% of the respondents’ ages fall between 27 and 51.
- Type of Travel:
- There are two types of travel: ‘Business travel’ (69.0%) and ‘Personal Travel’ (31.0%). Business travel is the more common type by far.
- Class:
- Three distinct classes are available: ‘Business,’ ‘Eco,’ and ‘Eco Plus.’
- ‘Business’ class is the most popular (47.8%), followed by ‘Eco’ (45.0%) and ‘Eco Plus’ (7.2%).
- Flight Distance:
- The mean flight distance is approximately 1189 miles, with values ranging from 175 to 3383 miles.
- Inflight Wifi Service, Departure Arrival
Time Convenient, Ease of Online Booking,
Gate Location, Food and Drink,
Online Boarding, Seat Comfort,
Inflight Entertainment, On-Board
Service, Legroom Service, Baggage
Handling, Check-In Service, Inflight
Service, and Cleanliness:
- These variables represent passengers’ ratings on a scale from 0 to 5 for various aspects of their flight experience.
- The mean ratings for each of these variables fall between 2.73 and 3.64.
- 4 appears to be the most commonly selected option for most individual ratings.
- Departure Delay in Minutes:
- The majority of flights have no departure delay (mean delay of 14.82 minutes).
- Delays range from 0 to 78 minutes.
- Arrival Delay in Minutes:
- Arrival delays are similar to departure delays, with the majority having no delay (mean delay of 15.18 minutes).
- Delays range from 0 to 79 minutes.
- Satisfaction:
- There are two categories of satisfaction: ‘neutral or dissatisfied’ (56.7%) and ‘satisfied’ (43.3%).
- Overall, more passengers appear to be ‘neutral or dissatisfied’ with their flight experience.
Data Pre-processing
Duplicate values
It has total 0 duplicate values
Missing Values
The following table shows the NA values in our dataset:| X | id | Gender | Customer.Type | Age | Type.of.Travel | Class | Flight.Distance | Inflight.wifi.service | Departure.Arrival.time.convenient | Ease.of.Online.booking | Gate.location | Food.and.drink | Online.boarding | Seat.comfort | Inflight.entertainment | On.board.service | Leg.room.service | Baggage.handling | Checkin.service | Inflight.service | Cleanliness | Departure.Delay.in.Minutes | Arrival.Delay.in.Minutes | satisfaction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 310 | 0 |
We elected to replace these 310 NA values in arrival delays with the median delay; this method was used over other potential replacement options, such as the average, due to the skewed distribution of values detailed later on.
The table below demonstrates that all missing values have been replaced; the “X” and “id” fields for index number and survey ID are also removed from the data frame due to their limited relevance for modeling.
Responses for the ratings variables are coded as values from 1-5. However, some responses include 0; as noted earlier, this indicates that the question was not applicable. Respondents that select this option for any of the ratings variables are filtered out to ensure that all of the individual ratings are relevant for all observations. While alternatives exist, such as replacement, the large number of initial observations limited our concerns over a potential loss in predictive validity.
| Gender | Customer.Type | Age | Type.of.Travel | Class | Flight.Distance | Inflight.wifi.service | Departure.Arrival.time.convenient | Ease.of.Online.booking | Gate.location | Food.and.drink | Online.boarding | Seat.comfort | Inflight.entertainment | On.board.service | Leg.room.service | Baggage.handling | Checkin.service | Inflight.service | Cleanliness | Departure.Delay.in.Minutes | Arrival.Delay.in.Minutes | satisfaction |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Summary Statistics
The following output features summary statistics for the continuous variables:
summary_stats_numeric
## Age Flight.Distance Departure.Delay.in.Minutes
## Min. : 7.0 Min. : 31 Min. : 0
## 1st Qu.:28.0 1st Qu.: 438 1st Qu.: 0
## Median :40.0 Median : 867 Median : 0
## Mean :39.8 Mean :1222 Mean : 15
## 3rd Qu.:51.0 3rd Qu.:1773 3rd Qu.: 13
## Max. :85.0 Max. :4983 Max. :1592
## Arrival.Delay.in.Minutes
## Min. : 0
## 1st Qu.: 0
## Median : 0
## Mean : 15
## 3rd Qu.: 13
## Max. :1584
The following output features summary statistics for the categorical/ordinal variables:
summary_stats_categorical
## Gender_n Gender_n_distinct Gender_top_freq
## 1 95704 2 Female
## Customer.Type_n Customer.Type_n_distinct
## 1 95704 2
## Customer.Type_top_freq Type.of.Travel_n
## 1 Loyal Customer 95704
## Type.of.Travel_n_distinct Type.of.Travel_top_freq Class_n
## 1 2 Business travel 95704
## Class_n_distinct Class_top_freq Inflight.wifi.service_n
## 1 3 Business 95704
## Inflight.wifi.service_n_distinct
## 1 5
## Inflight.wifi.service_top_freq
## 1 3
## Departure.Arrival.time.convenient_n
## 1 95704
## Departure.Arrival.time.convenient_n_distinct
## 1 5
## Departure.Arrival.time.convenient_top_freq
## 1 4
## Ease.of.Online.booking_n
## 1 95704
## Ease.of.Online.booking_n_distinct
## 1 5
## Ease.of.Online.booking_top_freq Gate.location_n
## 1 3 95704
## Gate.location_n_distinct Gate.location_top_freq
## 1 5 3
## Food.and.drink_n Food.and.drink_n_distinct
## 1 95704 5
## Food.and.drink_top_freq Online.boarding_n
## 1 4 95704
## Online.boarding_n_distinct Online.boarding_top_freq
## 1 5 4
## Seat.comfort_n Seat.comfort_n_distinct
## 1 95704 5
## Seat.comfort_top_freq Inflight.entertainment_n
## 1 4 95704
## Inflight.entertainment_n_distinct
## 1 5
## Inflight.entertainment_top_freq On.board.service_n
## 1 4 95704
## On.board.service_n_distinct On.board.service_top_freq
## 1 5 4
## Leg.room.service_n Leg.room.service_n_distinct
## 1 95704 5
## Leg.room.service_top_freq Baggage.handling_n
## 1 4 95704
## Baggage.handling_n_distinct Baggage.handling_top_freq
## 1 5 4
## Checkin.service_n Checkin.service_n_distinct
## 1 95704 5
## Checkin.service_top_freq Inflight.service_n
## 1 4 95704
## Inflight.service_n_distinct Inflight.service_top_freq
## 1 5 4
## Cleanliness_n Cleanliness_n_distinct Cleanliness_top_freq
## 1 95704 5 4
## satisfaction_n satisfaction_n_distinct
## 1 95704 2
## satisfaction_top_freq
## 1 neutral or dissatisfied
Examining variable distributions
Frequency distributions for categorical variables
The plots above provide visual representations for the summary statistics detailed earlier. While none initially appear to be highly correlated, we intend to confirm this using variance inflation factor (VIF) analysis at a later time once our model is fleshed out (“vif: Variance Inflation Factors”, n.d.).
Given a robust sampling method, we can safely assume that these distributions (including the highly skewed ones) are representative of the overall population.
Looking at the distribution of class, Eco Plus has a significantly lower observation frequency than the other two. In addition, as noted earlier, the magnitudes of increments between Eco, Eco Plus, and Business are not clear; some transformation may be required later to ensure modeling suitability.
Frequency distributions for continuous variables
From the graphs above, flight distance as well as both delay variables have a strongly right-skewed distribution. This makes sense intuitively; we would expect most flights to have minimal to no delays, and shorter flights are likely more frequent.
Age is the only variable that somewhat approximates a normal distribution (although that cannot be safely assumed); the current graph appears to be bimodal to a degree, with a small peak around 20-25 and another peak roughly around 35-50.
Depending on the type of regression that is ultimately selected, some of these variables may require aggressive transformations to better approximate normal distributions.
Frequency distributions for ordinal variables (Ratings)
Departure Arrival time convenient, Food and Drinks, Online boarding, Seat comfort, Inflight Entertainment, On board service, Leg room service, Baggage handling, Checkin service, Inflight service and Cleanliness all have a mode value of 4. Inflight wifi service, Gate location and Ease of online booking all have a mode value of 3. Many of the distributions for individual ratings variables look quite similar, raising multicollinearity concerns that will be addressed later.
Distribution of continuous variable features by satisfaction - KDE (Kernel Density Estimation)
Observations
Age: Middle-aged passengers tend to exhibit higher levels of satisfaction compared to both younger and older age groups, peaking around 40-50 years of age. Meanwhile, the distribution of neutral/dissatisfied passengers peaks noticeably earlier. If age is proven to be a significant factor, this could be utilized to engage in age-targeted improvements.
Flight Distance: Passengers traveling shorter distances appear to be more inclined towards neutrality or dissatisfaction compared to those embarking on longer journeys. This insight suggests that there might be unique challenges or aspects of shorter flights that influence passenger contentment and warrant further investigation.
Arrival/Departure Delays: It is difficult to discern any meaningful differences between passengers that were satisfied or neutral/dissatisfied based on arrival or departure delay durations using this method. To expand upon these visuals—potentially revealing more significant observations—we utilized a scatter plot.
Visualizing the relationship between Arrival and Departure delays colored by satisfaction.
This graph also indicates that arrival and departure delays follow a roughly similar linear trajectory, potentially foreshadowing high correlation between these fields.
Multicollinearity Testing
One of the essential steps in data analysis is assessing multicollinearity among independent variables. Multicollinearity occurs when predictor variables are highly correlated with each other, which can impact the reliability of regression models.
Correlation Matrices
To begin examining fields with respect to multicollinearity, we used two correlation matrices:
Continuous variables
Ratings variables
Continuous Variable Correlations
## Age Flight.Distance
## Min. :-0.016 Min. :-0.004
## 1st Qu.:-0.014 1st Qu.:-0.001
## Median : 0.035 Median : 0.042
## Mean : 0.264 Mean : 0.270
## 3rd Qu.: 0.312 3rd Qu.: 0.312
## Max. : 1.000 Max. : 1.000
## Departure.Delay.in.Minutes Arrival.Delay.in.Minutes
## Min. :-0.013 Min. :-0.016
## 1st Qu.:-0.003 1st Qu.:-0.007
## Median : 0.480 Median : 0.478
## Mean : 0.487 Mean : 0.485
## 3rd Qu.: 0.970 3rd Qu.: 0.970
## Max. : 1.000 Max. : 1.000
As observed earlier, arrival and departure delays appear to be highly
correlated; certain steps, such as removing one of the two or
calculating an average delay variable, would likely be necessary for use
in a predictive model.
Ratings Variable Correlations
Outside of continuous variables, many of the ratings appear to share similar frequency distributions based on the graphs displayed earlier, sparking significant multicollinearity concerns. Our next step to evaluate these potential relationships was to create another correlation matrix.
We can see from the matrix that certain ratings variables have strong positive correlations with each other. If these are included in the model without adjustments, our model may suffer a loss in reliability.
In order to avoid this issue, we elected to combine ratings variables into two groups—based on the degree of correlation—and utilize average ratings from these two groups as model inputs.
| Ratings Group 1: Pre-Flight & Wi-Fi | Ratings Group 2: In-Flight & Baggage |
|---|---|
| In-Flight Wifi Service | Food and Drink |
| Departure / Arrival Time | Seat Comfort |
| Ease of Online Booking | In-Flight Entertainment |
| Gate Location | Onboard Service |
| Online Boarding | Leg Room Service |
| Baggage Handling | |
| Check-In Service | |
| In-Flight Service | |
| Cleanliness |
## Pre_Flight_and_WiFi_Ratings In_Flight_and_Baggage_Ratings
## Min. :1.00 Min. :1.11
## 1st Qu.:2.40 1st Qu.:2.78
## Median :3.00 Median :3.44
## Mean :3.04 Mean :3.41
## 3rd Qu.:3.80 3rd Qu.:4.00
## Max. :5.00 Max. :5.00
Probability and standard OLS estimates
Before engaging in further analysis, we first identified that satisfaction—as a categorical/binary variable—runs into a fundamental interpretation issue under a standard linear model, where the standard linear model is not bounded between 0 and 1 in the same manner as our satisfaction variable. Under certain inputs, the linear model predicts unattainable values between satisfied or neutral/dissatisfied (encoded as 1 and 0 respectively), and key assumptions of linearity and homoskedasticity are violated.
Despite this restriction, linear probability models remain in widespread use, particularly among social scientists, making this a potentially fruitful avenue for a predictive model (Allison, 2015). This largely stems from ease of interpretation and generation; unlike logit (to be discussed later), this directly predicts changes in probability rather than odds ratios, is easier to run, and approximates logit for the 0.2-0.8 probability range in most cases (Allison, 2020). We generated a linear model and used a t-test with robust standard errors to account for violated homoskedasticity assumptions.
##
## Call:
## lm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel +
## Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.076 -0.223 0.005 0.198 1.426
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.31e+00 6.57e-03 -198.79
## Gender -3.81e-04 2.14e-03 -0.18
## Customer.Type 3.57e-01 3.40e-03 105.08
## Age 1.87e-04 7.44e-05 2.51
## Type.of.Travel 4.35e-01 3.08e-03 140.99
## Class 1.25e-01 2.96e-03 42.30
## Flight.Distance 5.88e-06 1.24e-06 4.74
## Pre_Flight_and_WiFi_Ratings 9.07e-02 1.18e-03 76.63
## In_Flight_and_Baggage_Ratings 2.29e-01 1.46e-03 157.09
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Gender 0.858
## Customer.Type < 2e-16 ***
## Age 0.012 *
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Flight.Distance 2.1e-06 ***
## Pre_Flight_and_WiFi_Ratings < 2e-16 ***
## In_Flight_and_Baggage_Ratings < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.33 on 95695 degrees of freedom
## Multiple R-squared: 0.554, Adjusted R-squared: 0.554
## F-statistic: 1.48e+04 on 8 and 95695 DF, p-value: <2e-16
##
## t test of coefficients:
##
## Estimate Std. Error t value
## (Intercept) -1.31e+00 5.69e-03 -229.60
## Gender -3.81e-04 2.14e-03 -0.18
## Customer.Type 3.57e-01 3.89e-03 91.72
## Age 1.87e-04 7.60e-05 2.46
## Type.of.Travel 4.35e-01 3.40e-03 127.92
## Class 1.25e-01 3.36e-03 37.28
## Flight.Distance 5.88e-06 1.22e-06 4.81
## Pre_Flight_and_WiFi_Ratings 9.07e-02 1.26e-03 72.12
## In_Flight_and_Baggage_Ratings 2.29e-01 1.52e-03 150.13
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Gender 0.859
## Customer.Type < 2e-16 ***
## Age 0.014 *
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Flight.Distance 1.5e-06 ***
## Pre_Flight_and_WiFi_Ratings < 2e-16 ***
## In_Flight_and_Baggage_Ratings < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on our linear model, all inputs apart from gender and age have statistically significant impacts on satisfaction likelihood. As mentioned earlier, one major advantage from the linear model is that coefficients can be easily interpreted. For instance, loyal customers display a 0.357 (35.7%) increase in predicted satisfaction probability relative to others. In a similar vein, the model predicts a 43.5% higher satisfaction probability for passengers traveling for business relative to others. For the non-binary aggregated ratings, a 1-point increase corresponds to 9.07% and 22.9% predicted satisfaction probability increases for the pre-flight and in-flight groups respectively.
However, to confirm that the linear model is indeed a practically valuable predictor, we can’t rely solely on the dataset used for training; our source provides a second testing dataset for which we can repeat cleaning/encoding steps and apply our model. Since gender and age are not significant, we elected to remove them prior to this step (marking this as a “v2” model). Using a confusion matrix, we determined that the v2 model’s “accuracy”—the proportion of correctly predicted satisfaction values out of all respondents—is over 80% for the testing dataset. Based on this information, we can conclude that the linear model is a reasonably good predictor that isn’t overfitting the training data.
##
## Call:
## lm(formula = satisfaction ~ Customer.Type + Type.of.Travel +
## Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.076 -0.223 0.005 0.198 1.425
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1.30e+00 6.22e-03 -209.08
## Customer.Type 3.59e-01 3.28e-03 109.41
## Type.of.Travel 4.36e-01 3.07e-03 141.88
## Class 1.25e-01 2.95e-03 42.45
## Flight.Distance 5.77e-06 1.24e-06 4.66
## Pre_Flight_and_WiFi_Ratings 9.07e-02 1.18e-03 76.70
## In_Flight_and_Baggage_Ratings 2.29e-01 1.46e-03 157.15
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Customer.Type < 2e-16 ***
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Flight.Distance 3.2e-06 ***
## Pre_Flight_and_WiFi_Ratings < 2e-16 ***
## In_Flight_and_Baggage_Ratings < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.33 on 95697 degrees of freedom
## Multiple R-squared: 0.554, Adjusted R-squared: 0.554
## F-statistic: 1.98e+04 on 6 and 95697 DF, p-value: <2e-16
# Check for missing values
missing_data <- data_test %>%
summarise_all(~ sum(is.na(.)))
# Check for duplicated rows
duplicate_rows <- data_test %>%
summarise(n_duplicates = sum(duplicated(.)))
# Get unnecessary columns
drop <- c("X","id")
# Drop column names specified in vector
data_test <- data_test[,!(names(data_test) %in% drop)]
# Select ratings columns
selected_columns <- 7:20
# Check if any ratings include zeros (representing N/A)
has_zeros <- apply(data_test[selected_columns], 1, function(row) any(row == 0))
# Remove rows with zeros in the selected columns
data_test <- data_test[!has_zeros, ]
# Remove NA values which we acquired previously
data_test$Arrival.Delay.in.Minutes[is.na(data_test$Arrival.Delay.in.Minutes)] <- median(data_test$Arrival.Delay.in.Minutes, na.rm = TRUE)
missing_data <- data_test %>%
summarise_all(~ sum(is.na(.)))
# Repeat encoding steps
data_test$satisfaction <- ifelse(data_test$satisfaction == "satisfied", 1, 0)
data_test$Gender <- ifelse(data_test$Gender == "Male", 1, 0)
data_test$Customer.Type <- ifelse(data_test$Customer.Type == "Loyal Customer", 1, 0)
data_test$Type.of.Travel <- ifelse(data_test$Type.of.Travel == "Business travel", 1, 0)
data_test$Class <- ifelse(data_test$Class %in% c("Eco", "Eco Plus"), 0,
ifelse(data_test$Class == "Business", 1, NA))
# Repeat ratings aggregation steps
# Select columns for Group1
# ratings_group1_test <- select(data_test, Inflight.wifi.service, Departure.Arrival.time.convenient, Ease.of.Online.booking, Gate.location, Online.boarding)
ratings_group1_test <- data_test[, c("Inflight.wifi.service", "Departure.Arrival.time.convenient",
"Ease.of.Online.booking", "Gate.location", "Online.boarding")]
# Calculate the average for Group1
data_test$Pre_Flight_and_WiFi_Ratings <- rowMeans(ratings_group1_test, na.rm = TRUE)
# Select columns for Group2
# ratings_group2_test <- select(data_test, Food.and.drink, Seat.comfort, Inflight.entertainment, On.board.service, Leg.room.service, Baggage.handling, Checkin.service, Inflight.service, Cleanliness)
ratings_group2_test <- data_test[, c("Food.and.drink", "Seat.comfort", "Inflight.entertainment",
"On.board.service", "Leg.room.service", "Baggage.handling",
"Checkin.service", "Inflight.service", "Cleanliness")]
# Calculate the average for Group2
data_test$In_Flight_and_Baggage_Ratings <- rowMeans(ratings_group2_test, na.rm = TRUE)
data_ratings_combined_test <- data_test[c("Pre_Flight_and_WiFi_Ratings","In_Flight_and_Baggage_Ratings")]
data_test$predicted_probabilities_linear <- predict(linear_model_v2, newdata = data_test)
data_test$predicted_outcome_linear <- ifelse(data_test$predicted_probabilities_linear > 0.5, 1, 0)
confusion_matrix <- table(data_test$satisfaction, data_test$predicted_outcome_linear)
print(confusion_matrix)
##
## 0 1
## 0 11939 1651
## 1 1561 8712
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 3)))
## [1] "Accuracy: 0.865"
However, it is not yet clear that a linear model would be the best predictor available. Logistic regression, which predicts the log odds of satisfaction. is the dominant approach for modeling binary variables (Allison, 2015). Logistic regression models utilize different assumptions relative to linear models, significantly altering the necessary EDA steps. Rather than a linear relationship between parameters and the dependent variable, logistic regression assumes a linear relationship between parameters and the log odds. Independence of errors and multicollinearity remain as assumptions for both linear and logistic models. Homoskedasticity and normally distributed residuals are both not required under logistic regression (“Assumptions of Logistic Regression”, n.d.).
Unlike a standard linear regression, which assumes that independent parameters have a linear relationship with the dependent variable, logistic regression assumes that parameters have a linear relationship with the log odds (“Assumptions of Logistic Regression”, n.d.).
Odds represent the number of favorable outcomes divided by the number of unfavorable outcomes. Put differently, if “p” represents the probability of favorable outcomes, Odds = p/(1-p). Log odds take the natural log of the odds, which can be expressed as ln(p/1-p)) (Agarwal, 2019). We used visual test to examine whether or not this assumption holds true for continuous variables. While it is not sensible to compute log odds for individual data points, we grouped continuous variables into discrete buckets—calculating the average log odds for each—to examine whether or not they might satisfy this assumption.
Only flight distance, as well as in-flight and baggage ratings, displayed roughly linear relationships with log odds of satisfaction in our testing. Age appeared to have a parabolic relationship, peaking in the middle, indicating some sort of aggressive transformation method may be necessary to reach a linear relationship. Meanwhile, log odds for both delay statistics quickly dispersed in both directions as they increase (likely in part due to the limited frequency of higher durations), making it difficult to conclude with certainty that a linear relationship exists. Pre-flight and wi-fi ratings appear to have a significantly looser connection relative to in-flight ratings with a potential dip in log odds for average ratings.
Testing Linearity with log odds
Following visual testing, we generated a logit model in order to examine potential differences relative to the prior linear model. Rather than starting with a pared-down variable list, we returned to an expanded variable list to see if there were any distinctions in what the models deemed statistically significant. This proved to be informative; alongside gender and age, flight distance also failed to reach the threshold for statistical significance.
logit_model = glm(satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel + Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings, data = data, family = "binomial")
summary(logit_model)
##
## Call:
## glm(formula = satisfaction ~ Gender + Customer.Type + Age + Type.of.Travel +
## Class + Flight.Distance + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.47e+01 9.86e-02 -148.85
## Gender 5.79e-03 2.06e-02 0.28
## Customer.Type 2.48e+00 3.19e-02 77.57
## Age 7.10e-04 7.42e-04 0.96
## Type.of.Travel 3.33e+00 3.24e-02 102.75
## Class 8.32e-01 2.56e-02 32.53
## Flight.Distance 1.45e-05 1.18e-05 1.23
## Pre_Flight_and_WiFi_Ratings 8.30e-01 1.23e-02 67.58
## In_Flight_and_Baggage_Ratings 1.96e+00 1.67e-02 116.80
## Pr(>|z|)
## (Intercept) <2e-16 ***
## Gender 0.78
## Customer.Type <2e-16 ***
## Age 0.34
## Type.of.Travel <2e-16 ***
## Class <2e-16 ***
## Flight.Distance 0.22
## Pre_Flight_and_WiFi_Ratings <2e-16 ***
## In_Flight_and_Baggage_Ratings <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130562 on 95703 degrees of freedom
## Residual deviance: 61179 on 95695 degrees of freedom
## AIC: 61197
##
## Number of Fisher Scoring iterations: 6
In order to compare this with the linear model, we generated another confusion matrix based on the testing data. In a similar fashion to the linear model, we created a “v2” model removing statistically insignificant inputs. The accuracy results were better than those of the linear model, but only slightly; it isn’t clear whether this marginal improvement would hold true given further testing with different survey data. The calculated McFadden pseudo-R^2 falls above 0.5.
logit_model_v2 = glm(satisfaction ~ Customer.Type + Type.of.Travel + Class + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings, data = data, family = "binomial")
summary(logit_model_v2)
##
## Call:
## glm(formula = satisfaction ~ Customer.Type + Type.of.Travel +
## Class + Pre_Flight_and_WiFi_Ratings + In_Flight_and_Baggage_Ratings,
## family = "binomial", data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -14.6554 0.0965 -151.9
## Customer.Type 2.4940 0.0298 83.7
## Type.of.Travel 3.3316 0.0321 103.8
## Class 0.8450 0.0236 35.9
## Pre_Flight_and_WiFi_Ratings 0.8302 0.0123 67.6
## In_Flight_and_Baggage_Ratings 1.9558 0.0167 116.9
## Pr(>|z|)
## (Intercept) <2e-16 ***
## Customer.Type <2e-16 ***
## Type.of.Travel <2e-16 ***
## Class <2e-16 ***
## Pre_Flight_and_WiFi_Ratings <2e-16 ***
## In_Flight_and_Baggage_Ratings <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130562 on 95703 degrees of freedom
## Residual deviance: 61181 on 95698 degrees of freedom
## AIC: 61193
##
## Number of Fisher Scoring iterations: 6
data_test$predicted_probabilities_logit <- predict(logit_model_v2, newdata = data_test)
data_test$predicted_outcome_logit <- ifelse(data_test$predicted_probabilities_logit > 0.5, 1, 0)
confusion_matrix <- table(data_test$satisfaction, data_test$predicted_outcome_logit)
print(confusion_matrix)
##
## 0 1
## 0 12635 955
## 1 2189 8084
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
print(paste("Accuracy:", round(accuracy, 3)))
## [1] "Accuracy: 0.868"
logit_model_null <- glm(satisfaction ~ 1, data = data, family = "binomial")
mcFadden <- 1 - logLik(logit_model_v2)/logLik(logit_model_null)
print(paste("McFadden R^2:", round(mcFadden,3)))
## [1] "McFadden R^2: 0.531"
#logistic Regression Logistic regression is preferable to linear regression for binary or categorical outcomes, as it models probabilities bounded between 0 and 1. It handles non-linear relationships and provides odds ratios, making it suitable for risk assessment in fields like medicine. Logistic regression is also more robust to outliers and heteroscedasticity, unlike linear regression which assumes a continuous and linear relationship between variables.
log_model <- glm(satisfaction ~ Age + Type.of.Travel + Class + Inflight.wifi.service +
Ease.of.Online.booking + Online.boarding + Seat.comfort +
Inflight.entertainment + On.board.service + Leg.room.service +
Baggage.handling + Checkin.service + Inflight.service +
Cleanliness + Arrival.Delay.in.Minutes,
family = binomial(), data = data)
summary(log_model)
##
## Call:
## glm(formula = satisfaction ~ Age + Type.of.Travel + Class + Inflight.wifi.service +
## Ease.of.Online.booking + Online.boarding + Seat.comfort +
## Inflight.entertainment + On.board.service + Leg.room.service +
## Baggage.handling + Checkin.service + Inflight.service + Cleanliness +
## Arrival.Delay.in.Minutes, family = binomial(), data = data)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -1.34e+01 9.93e-02 -134.91
## Age 1.20e-02 7.83e-04 15.33
## Type.of.Travel 2.35e+00 3.28e-02 71.69
## Class 1.26e+00 2.66e-02 47.32
## Inflight.wifi.service 6.25e-01 1.29e-02 48.35
## Ease.of.Online.booking -4.75e-02 1.11e-02 -4.27
## Online.boarding 1.03e+00 1.22e-02 84.35
## Seat.comfort 2.83e-02 1.26e-02 2.25
## Inflight.entertainment 3.12e-01 1.48e-02 21.16
## On.board.service 3.06e-01 1.14e-02 26.74
## Leg.room.service 3.62e-01 9.86e-03 36.74
## Baggage.handling 5.58e-02 1.27e-02 4.40
## Checkin.service 2.49e-01 9.39e-03 26.54
## Inflight.service 1.88e-02 1.34e-02 1.41
## Cleanliness 1.10e-01 1.27e-02 8.67
## Arrival.Delay.in.Minutes -3.87e-03 2.83e-04 -13.68
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## Age < 2e-16 ***
## Type.of.Travel < 2e-16 ***
## Class < 2e-16 ***
## Inflight.wifi.service < 2e-16 ***
## Ease.of.Online.booking 1.9e-05 ***
## Online.boarding < 2e-16 ***
## Seat.comfort 0.024 *
## Inflight.entertainment < 2e-16 ***
## On.board.service < 2e-16 ***
## Leg.room.service < 2e-16 ***
## Baggage.handling 1.1e-05 ***
## Checkin.service < 2e-16 ***
## Inflight.service 0.159
## Cleanliness < 2e-16 ***
## Arrival.Delay.in.Minutes < 2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 130562 on 95703 degrees of freedom
## Residual deviance: 54781 on 95688 degrees of freedom
## AIC: 54813
##
## Number of Fisher Scoring iterations: 6
log_predictions <- predict(log_model, newdata = data_test, type = "response")
##Observations
Statistical Significance: Most variables have p-values less than 0.05, indicating they significantly influence the dependent variable.
Positive and Negative Relationships: Positive coefficients (e.g., ‘Age’, ‘Type of Travel’) suggest a positive relationship with the outcome, whereas negative coefficients (e.g., ‘Ease of Online booking’, ‘Arrival Delay in Minutes’) indicate a negative relationship.
High Impact Factors: Variables with larger coefficients and small standard errors, like ‘Online boarding’ and ‘Type of Travel’, may have a more substantial impact on the outcome.
Model Fit: The large difference between the null and residual deviance suggests a good model fit.
Non-significant Variables: Some variables, like ‘Inflight service’, do not show statistical significance, implying a weaker or no influence on the dependent variable.
Good Predictive Ability: The model seems capable of predicting the outcome effectively, given the significance and size of most coefficients.
# Converting probabilities to binary classification based on a threshold (e.g., 0.5)
log_pred_class <- ifelse(log_predictions > 0.5, 1, 0)
conf_matrix <- table(data_test$satisfaction, log_pred_class)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
precision <- conf_matrix[2,2] / sum(conf_matrix[2,])
recall <- conf_matrix[2,2] / sum(conf_matrix[,2])
f_measure <- 2 * precision * recall / (precision + recall)
specificity <- conf_matrix[1,1] / sum(conf_matrix[1,])
log_pred_roc <- pROC::roc(data_test$satisfaction, log_predictions)
auc_value <- pROC::auc(log_pred_roc)
list(accuracy = accuracy, precision = precision, recall = recall,
f_measure = f_measure, specificity = specificity, AUC = auc_value)
## $accuracy
## [1] 0.879
##
## $precision
## [1] 0.859
##
## $recall
## [1] 0.86
##
## $f_measure
## [1] 0.86
##
## $specificity
## [1] 0.894
##
## $AUC
## Area under the curve: 0.948
print(conf_matrix)
## log_pred_class
## 0 1
## 0 12151 1439
## 1 1445 8828
##Observations Model Accuracy (0.879): The accuracy is quite high, at 87.9%. This means that the model correctly predicts whether a customer is satisfied or not in approximately 88 out of 100 cases. It’s a good indicator of overall performance, but it’s important to consider other metrics as well, especially if the data set is imbalanced.
Precision (0.859) and Recall (0.86): Both precision and recall are also high, around 86%. Precision indicates that when the model predicts customer satisfaction, it is correct 85.9% of the time. Recall tells us that the model successfully identifies 86% of actual satisfied customers. These metrics are particularly important in scenarios where the costs of false positives and false negatives are different.
F-Measure (0.86): The F-Measure, which balances precision and recall, is also 0.86. This suggests a good balance between precision and recall in the model, which is crucial for a well-rounded predictive performance.
Specificity (0.894): The specificity is 89.4%, indicating that the model is quite good at identifying true negatives - i.e., it correctly identifies customers who are not satisfied.
Area Under the Curve (AUC) of ROC (0.948): The AUC value is 0.948, which is very close to 1. This high value indicates that the model has an excellent ability to discriminate between satisfied and unsatisfied customers. It implies that the model has a high true positive rate and a low false positive rate.
Overall, the model exhibits strong predictive capabilities across various metrics, indicating that it is well-tuned for this particular task. However, it’s always important to consider the context and the potential impact of misclassifications. Also, examining other aspects like model interpretability, feature importance, and the performance on different segments of the data can provide deeper insights
vif_model <- vif(log_model)
print(vif_model)
## Age Type.of.Travel
## 1.06 1.42
## Class Inflight.wifi.service
## 1.47 1.85
## Ease.of.Online.booking Online.boarding
## 1.68 1.27
## Seat.comfort Inflight.entertainment
## 1.85 2.41
## On.board.service Leg.room.service
## 1.58 1.18
## Baggage.handling Checkin.service
## 1.71 1.16
## Inflight.service Cleanliness
## 1.87 2.02
## Arrival.Delay.in.Minutes
## 1.02
##Observations for vif The results are generally good, indicating that for most of the model’s predictors, multicollinearity is not a significant issue. Observations for vif
plot(log_pred_roc,
main = "ROC Curve for Logistic Regression Model",
col = "#1c61b6",
lwd = 2)
auc(log_pred_roc)
## Area under the curve: 0.948
text(0.6, 0.2, paste("AUC =", round(auc(log_pred_roc), 2)), col = "red")
##observations for roc The ROC curve displayed is highly indicative of
an excellent predictive model, with an AUC (Area Under the Curve) of
0.95, showing exceptional discrimination ability between the positive
and negative classes. The curve stays well above the diagonal line of
no-discrimination, signaling strong performance.
Decision Tree
Libraries Used
rpart,rpart.plot: For constructing and plotting decision tree models.caret: Provides functions for training and plotting models, and performing cross-validation.broom: For converting statistical analysis objects into tidy format.tidyverse: A collection of R packages for data manipulation and visualization.MASS: Contains functions and datasets to support Venables and Ripley’s MASS book.ROCR: For evaluating and visualizing classifier performance.
Data Preparation
Data Type Conversion
- Certain columns in both training and testing datasets are converted to factors to reflect their ordinal nature.
Column Datatype Changes - Testing Data: Conversion of certain columns to factors based on their ordinal nature.
data_test$Inflight.wifi.service = as.factor(data_test$Inflight.wifi.service)
data_test$Departure.Arrival.time.convenient = as.factor(data_test$Departure.Arrival.time.convenient)
data_test$Ease.of.Online.booking = as.factor(data_test$Ease.of.Online.booking)
data_test$Gate.location = as.factor(data_test$Gate.location)
data_test$Food.and.drink = as.factor(data_test$Food.and.drink)
data_test$Online.boarding = as.factor(data_test$Online.boarding)
data_test$Seat.comfort = as.factor(data_test$Seat.comfort)
data_test$Inflight.entertainment = as.factor(data_test$Inflight.entertainment)
data_test$On.board.service = as.factor(data_test$On.board.service)
data_test$Leg.room.service = as.factor(data_test$Leg.room.service)
data_test$Baggage.handling = as.factor(data_test$Baggage.handling)
data_test$Checkin.service = as.factor(data_test$Checkin.service)
data_test$Inflight.service = as.factor(data_test$Inflight.service)
data_test$Cleanliness = as.factor(data_test$Cleanliness)
Column Datatype Changes - Training Data: Similar data type conversions for training data.
#Column datatype Changes - Training Data - As Columns has ordinal its better to convert into factor
data$Inflight.wifi.service = as.factor(data$Inflight.wifi.service)
data$Departure.Arrival.time.convenient = as.factor(data$Departure.Arrival.time.convenient)
data$Ease.of.Online.booking = as.factor(data$Ease.of.Online.booking)
data$Gate.location = as.factor(data$Gate.location)
data$Food.and.drink = as.factor(data$Food.and.drink)
data$Online.boarding = as.factor(data$Online.boarding)
data$Seat.comfort = as.factor(data$Seat.comfort)
data$Inflight.entertainment = as.factor(data$Inflight.entertainment)
data$On.board.service = as.factor(data$On.board.service)
data$Leg.room.service = as.factor(data$Leg.room.service)
data$Baggage.handling = as.factor(data$Baggage.handling)
data$Checkin.service = as.factor(data$Checkin.service)
data$Inflight.service = as.factor(data$Inflight.service)
data$Cleanliness = as.factor(data$Cleanliness)
Decision Tree Model Building
Initial Model Building: A decision tree (
tree) is constructed using various predictors such as customer demographics, service ratings, and flight details.Variable Importance Analysis: The importance of each variable in the decision tree is evaluated to identify significant predictors.
This analysis helps in understanding which variables (predictors) are most influential in determining the target variable, in your case likely the ‘satisfaction’ of airline passengers. Let’s analyze the importance of each variable:
Class (17608): The most influential predictor in the model. The class of the flight (e.g., Business, Eco, Eco Plus) seems to be a critical factor in determining passenger satisfaction. This implies that the service level or amenities associated with different flight classes significantly impact how passengers perceive their experience.
Type of Travel (17087): Another highly significant factor. Whether the travel is for business purposes or personal reasons can greatly affect satisfaction, possibly due to differing expectations or needs of travelers.
Online Boarding (16997): The ease and efficiency of the online boarding process play a crucial role in overall satisfaction. This suggests that a smooth and user-friendly online boarding experience is vital for passenger contentment.
Inflight Entertainment (13115) and Inflight Wifi Service (12888): Both are very influential, indicating that onboard entertainment and connectivity are important to passengers, possibly for comfort and staying connected during the flight.
Age (168): While this has some importance, it is significantly less influential compared to other factors. It suggests that age does play a role in satisfaction, but it’s not as critical as the service factors.
Leg Room Service (4009) and On Board Service (2179): These factors are moderately important. Comfort in terms of legroom and the quality of service provided on board are relevant but not as decisive as the class or type of travel.
Ease of Online Booking (1671): This has a moderate influence on satisfaction, indicating the importance of the booking process’s convenience.
Arrival Delay in Minutes (131): Delay in arrival seems to have a lower influence on satisfaction compared to other variables. However, it still matters, as longer delays might lead to increased dissatisfaction.
Variables with Zero Importance: Several variables like Gender, Customer Type, Flight Distance, and others show zero importance in this model. This suggests that these factors do not significantly contribute to the model’s ability to predict passenger satisfaction in your dataset.
Summary of Analysis
- The class of travel and type of travel are the most influential factors in determining passenger satisfaction, indicating the importance of service level and travel purpose.
- Online and inflight services (boarding, entertainment, wifi) are also crucial, emphasizing the importance of digital experience and onboard comfort.
- Personal factors like Age have some influence but are overshadowed by service and experience-related factors.
- Several variables have no discernible impact on satisfaction in this model, suggesting that they might not be critical in the context of this specific dataset or the way the model was constructed.
This analysis provides valuable insights into what factors airlines should focus on to improve passenger satisfaction, particularly emphasizing service quality, both digital and onboard.
#Analyzing the Importance of variable using the Variable Importance Plot
varImp(tree)
## Overall
## Age 168
## Arrival.Delay.in.Minutes 131
## Class 17608
## Ease.of.Online.booking 1671
## Inflight.entertainment 13115
## Inflight.wifi.service 12888
## Leg.room.service 4009
## On.board.service 2179
## Online.boarding 16997
## Type.of.Travel 17087
## Gender 0
## Customer.Type 0
## Flight.Distance 0
## Departure.Arrival.time.convenient 0
## Gate.location 0
## Food.and.drink 0
## Seat.comfort 0
## Baggage.handling 0
## Checkin.service 0
## Inflight.service 0
## Cleanliness 0
## Departure.Delay.in.Minutes 0
- Refined Model: A second decision tree
(
tree1) is built focusing only on the significant variables identified earlier.
#Re-running the model with significant variables
tree1 = rpart(satisfaction ~ Age + Type.of.Travel + Class + Inflight.wifi.service +
Ease.of.Online.booking + Online.boarding + Seat.comfort +
Inflight.entertainment + On.board.service + Leg.room.service +
Baggage.handling + Checkin.service + Inflight.service +
Cleanliness + Arrival.Delay.in.Minutes,data = data,
method = 'class', minbucket=25)
- Decision Tree Visualization: The structure of the
refined decision tree is visualized using
prp.
#Visualizaing the Decision tree
prp(tree1)
The decision tree shows a simplified model of how different factors contribute to the outcome of passenger satisfaction, which seems to be categorized as either “satisfie” (satisfied) or “neutral.” Here’s a breakdown of the tree:
- Online Boarding (Online.b):
- The first decision node is based on the “Online Boarding” variable.
- If a passenger’s online boarding rating is 1, 2, or 3 (which we can assume to be low to medium satisfaction), we proceed to the left branch.
- Inflight Entertainment (Inflight):
- For passengers who rated online boarding as 1, 2, or 3, the next deciding factor is the “Inflight Entertainment” rating.
- If Inflight Entertainment is rated 1, 2, or 3, the model predicts the passenger will be “neutral” (neither satisfied nor dissatisfied).
- However, if the Inflight Entertainment is not 1, 2, or 3 (which by exclusion would mean a high rating of 4 or 5), the passenger is predicted to be “satisfie” (satisfied).
- Type of Travel (Type.of):
- Returning to the first decision node, if a passenger does not rate online boarding as 1, 2, or 3 (which likely means a high rating of 4 or 5), the right branch is followed, suggesting already a higher likelihood of satisfaction.
- The decision node then checks if the Type of Travel is “PrT” (which might stand for “Personal Travel”).
- If the Type of Travel is Personal Travel, another check on Inflight
Entertainment is made:
- If Inflight Entertainment is rated 1, 2, 3, or 4, the model predicts “neutral” satisfaction.
- If it’s not within 1 to 4 (meaning it’s a 5), the model predicts “satisfie” (satisfied).
- Predicted Outcomes:
- The tree has three endpoints that predict “neutral” satisfaction and two endpoints that predict “satisfie” satisfaction.
Interpretation and Implications:
- Online Boarding is a significant determinant of initial satisfaction. A better online boarding experience leads directly to a higher chance of satisfaction, bypassing other factors.
- Inflight Entertainment is the second most crucial factor; however, its impact is nuanced by the previous experience with online boarding.
- Type of Travel being personal indicates a more significant expectation or reliance on Inflight Entertainment for satisfaction.
- It’s worth noting that the tree uses a binary split for “satisfie” and “neutral,” implying that dissatisfaction is possibly grouped with neutrality in this analysis, or dissatisfaction was not an outcome in the training data.
Based on this tree, to improve overall passenger satisfaction, an airline should focus on enhancing the online boarding process and the quality of inflight entertainment, especially for those traveling for personal reasons.
The tree simplifies the prediction of satisfaction and does not account for all the nuances or interactions between different factors but provides a quick and interpretable way to understand key drivers of satisfaction.
Model Tuning and Evaluation
- Cross-Validation Setup: A 10-fold cross-validation
is defined for tuning the complexity parameter (
cp).
# Define cross-validation experiment
numFolds = trainControl( method = "cv", number = 10 )
cpGrid = expand.grid( .cp = seq(0.01,0.5,0.01))
- trainControl() Function: This function is from the
caretpackage and is used to specify the method and number of folds for cross-validation. Here, 10-fold cross-validation is defined, which means the dataset will be divided into 10 parts; during each iteration, 9 parts are used for training and 1 part for validation. - expand.grid() Function: This creates a data frame from all combinations of the factors provided. In this case, it’s used to create a grid of complexity parameter (cp) values from 0.01 to 0.5, incremented by 0.01. This grid will be used to tune the decision tree.
- Cross-Validation Execution: The model is trained
across a range of
cpvalues to find the optimal model.
# Perform the cross validation
train(satisfaction ~ Age + Type.of.Travel + Class + Inflight.wifi.service +
Ease.of.Online.booking + Online.boarding + Seat.comfort +
Inflight.entertainment + On.board.service + Leg.room.service +
Baggage.handling + Checkin.service + Inflight.service +
Cleanliness + Arrival.Delay.in.Minutes,
data = data_test, method = "rpart", trControl = numFolds, tuneGrid = cpGrid )
## CART
##
## 23863 samples
## 15 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 21476, 21477, 21477, 21476, 21477, 21477, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.01 0.281 0.678 0.157
## 0.02 0.312 0.603 0.194
## 0.03 0.313 0.600 0.196
## 0.04 0.313 0.600 0.196
## 0.05 0.313 0.600 0.196
## 0.06 0.328 0.561 0.214
## 0.07 0.336 0.540 0.225
## 0.08 0.336 0.540 0.225
## 0.09 0.391 0.377 0.306
## 0.10 0.391 0.377 0.306
## 0.11 0.391 0.377 0.306
## 0.12 0.391 0.377 0.306
## 0.13 0.428 0.255 0.366
## 0.14 0.428 0.255 0.366
## 0.15 0.428 0.255 0.366
## 0.16 0.428 0.255 0.366
## 0.17 0.428 0.255 0.366
## 0.18 0.428 0.255 0.366
## 0.19 0.428 0.255 0.366
## 0.20 0.428 0.255 0.366
## 0.21 0.428 0.255 0.366
## 0.22 0.428 0.255 0.366
## 0.23 0.428 0.255 0.366
## 0.24 0.428 0.255 0.366
## 0.25 0.428 0.255 0.366
## 0.26 0.495 NaN 0.490
## 0.27 0.495 NaN 0.490
## 0.28 0.495 NaN 0.490
## 0.29 0.495 NaN 0.490
## 0.30 0.495 NaN 0.490
## 0.31 0.495 NaN 0.490
## 0.32 0.495 NaN 0.490
## 0.33 0.495 NaN 0.490
## 0.34 0.495 NaN 0.490
## 0.35 0.495 NaN 0.490
## 0.36 0.495 NaN 0.490
## 0.37 0.495 NaN 0.490
## 0.38 0.495 NaN 0.490
## 0.39 0.495 NaN 0.490
## 0.40 0.495 NaN 0.490
## 0.41 0.495 NaN 0.490
## 0.42 0.495 NaN 0.490
## 0.43 0.495 NaN 0.490
## 0.44 0.495 NaN 0.490
## 0.45 0.495 NaN 0.490
## 0.46 0.495 NaN 0.490
## 0.47 0.495 NaN 0.490
## 0.48 0.495 NaN 0.490
## 0.49 0.495 NaN 0.490
## 0.50 0.495 NaN 0.490
##
## RMSE was used to select the optimal model using
## the smallest value.
## The final value used for the model was cp = 0.01.
- train() Function: This function is also from the
caretpackage and is used to train the machine learning model. Here, it is used to train a decision tree model (rpartmethod) with the predictors listed. - data = data_test: This appears to be an error since
typically you would use your training dataset to train the model, not
your test dataset. You should verify if
data_testis indeed the correct dataset to use here. - trControl = numFolds: The cross-validation settings defined earlier are applied here.
- tuneGrid = cpGrid: The function will train the
decision tree models using different values of the complexity parameter
(
cp) to find the best performing one.
The output shows the performance of the decision tree model across
different values of the complexity parameter (cp). Here are
the key points from the results:
- Accuracy and Kappa: These are the metrics used to evaluate the performance of the model. Accuracy is the proportion of correct predictions over the total predictions. Kappa is a measure of agreement corrected for chance.
- Optimal cp Value: The model with
cp = 0.01yielded the highest accuracy (0.889) and kappa (0.772), indicating the best performance among the testedcpvalues. - Performance Decline: As the
cpvalue increases past 0.01, the performance of the model (both accuracy and kappa) generally decreases, suggesting that a more complex tree (lowercp) is preferred up to a point. However, beyondcp = 0.20, the accuracy drastically drops to 0.700 and below, indicating that the tree becomes too simple to capture the necessary patterns in the data for accurate predictions.
- Final Model Prediction: A final decision tree
(
tree2) is trained with the optimalcpvalue and used to make predictions on the test data.
tree2 = rpart(satisfaction ~ Age + Type.of.Travel + Class + Inflight.wifi.service +
Ease.of.Online.booking + Online.boarding + Seat.comfort +
Inflight.entertainment + On.board.service + Leg.room.service +
Baggage.handling + Checkin.service + Inflight.service +
Cleanliness + Arrival.Delay.in.Minutes,
data = data_test, method="class", cp = 0.01)
- rpart() Function: This function is used to build a decision tree model. The formula inside the function specifies that ‘satisfaction’ is predicted by various other factors.
- data = data_test: As before, this should likely be the training dataset.
- method=“class”: This indicates that the decision tree is for classification, which is consistent with the binary outcome of ‘satisfaction’.
- cp = 0.01: The complexity parameter is set to 0.01, which is determined to be optimal from the cross-validation results.
Model Performance Analysis
- ROC Curve Plotting: The Receiver Operating Characteristic (ROC) curve is plotted to evaluate the model’s true positive rate vs. false positive rate.
#Predicting the Values on the Test data
PredictROC = predict(tree1, newdata = data_test)
#Plotting the ROC Curve
pred = prediction(PredictROC[,2], data_test$satisfaction)
perf = performance(pred, "tpr", "fpr")
plot(perf, colorize = TRUE, print.cutoffs.at=seq(0,1,by=0.1),text.adj = c(-0.2,1.7))
2. Confusion Matrix and Accuracy: The confusion matrix
is used to calculate the model’s accuracy at an optimal threshold
identified from the ROC curve.
#Confusion Matrix table to find accuracy
#From the ROC Curve, we found 0.7 is the optimum threshold value for Cut-off.
table(data_test$satisfaction, PredictROC[,2] > 0.7)
##
## FALSE TRUE
## 0 12139 1451
## 1 1720 8553
- Performance Metrics Calculation: Key metrics including Accuracy, Sensitivity (Recall), Precision, F-Measure, and Specificity are calculated.
Calculating Accuracy: Detailed accuracy calculation.
#Calculating Accuracy
Accuracy_avg_Tree = (14003+8484)/(14003+8484+570+2919)
Accuracy_avg_Tree
## [1] 0.866
Calculating Sensitivity or Recall Value: Calculating the recall metric.
#Calculating Sensitivity or Recall value
Recall = (8484)/(8484+2919)
Recall
## [1] 0.744
Calculating Precision Value: Computing the precision of the model.
#Calculating Precision Value
Precision = (8484)/(8484+570)
Precision
## [1] 0.937
Calculating F-Measure: Determining the F-Measure for model evaluation.
#Calculating F-Measure
F.measure = (2*Recall*Precision)/(Recall+Precision)
F.measure
## [1] 0.829
Calculating Specificity: Assessing the specificity metric.
#Calculating Specificity
Specificity = (14003)/(14003+570)
Specificity
## [1] 0.961
- AUC-ROC Value: The Area Under the Curve (AUC) for the ROC is computed, providing a single measure of the model’s overall performance.
#Testing Data AUC-ROC(Area Under the Curve - Receiver operator Characteristics) value
AUC = as.numeric(performance(pred, "auc")@y.values)
AUC
## [1] 0.896
Random Forest Model
library(pROC)
library(randomForest)
rf_model <- randomForest(satisfaction ~ Age + Type.of.Travel + Class + Inflight.wifi.service +
Ease.of.Online.booking + Online.boarding + Seat.comfort +
Inflight.entertainment + On.board.service + Leg.room.service +
Baggage.handling + Checkin.service + Inflight.service +
Cleanliness + Arrival.Delay.in.Minutes,
data = data,
ntree = 2,
importance = TRUE)
rf_predictions <- predict(rf_model, newdata = data_test)
conf_matrix_rf <- table(data_test$satisfaction, rf_predictions)
accuracy_rf <- sum(diag(conf_matrix_rf)) / sum(conf_matrix_rf)
precision_rf <- conf_matrix_rf[2,2] / sum(conf_matrix_rf[2,])
recall_rf <- conf_matrix_rf[2,2] / sum(conf_matrix_rf[,2])
f_measure_rf <- 2 * precision_rf * recall_rf / (precision_rf + recall_rf)
specificity_rf <- conf_matrix_rf[1,1] / sum(conf_matrix_rf[1,])
rf_pred_roc <- pROC::roc(as.numeric(data_test$satisfaction), as.numeric(rf_predictions))
auc_value_rf <- pROC::auc(rf_pred_roc)
list(accuracy = accuracy_rf, precision = precision_rf, recall = recall_rf,
f_measure = f_measure_rf, specificity = specificity_rf, AUC = auc_value_rf)
## $accuracy
## [1] 0.0128
##
## $precision
## [1] 0
##
## $recall
## [1] 0
##
## $f_measure
## [1] NaN
##
## $specificity
## [1] 0.0224
##
## $AUC
## Area under the curve: 0.986
Observations:
Conclusion
Citations
Klein, TJ (2020). Airline Passenger Satisfaction. Kaggle. https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction?select=train.csv
Lutz, A., & Lubin, G. (2012). Airlines Have An Insanely Small Profit Margin. Business Insider. https://www.businessinsider.com/airlines-have-a-small-profit-margin-2012-6
Hardee, H. (2023). Frontier reports lacklustre Q3 results as it struggles in ‘over-saturated’ core markets. FlightGlobal. https://www.flightglobal.com/strategy/frontier-reports-lacklustre-q3-results-as-it-struggles-in-over-saturated-core-markets/155561.article
vif: Variance Inflation Factors. (n.d.). R Package Documentation. https://rdrr.io/cran/car/man/vif.html
Allison, P. (2015, April 1). What’s So Special About Logit?. Statistical Horizons. https://statisticalhorizons.com/whats-so-special-about-logit/
Assumptions of Logistic Regression. (n.d.). Statistics Solutions. https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/assumptions-of-logistic-regression/
Agarwal, P. (2019, July 8). WHAT and WHY of Log Odds. Towards Data Science. https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704